Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: non-local redundancy #4491

Merged
merged 24 commits into from
Feb 8, 2024
Merged

feat: non-local redundancy #4491

merged 24 commits into from
Feb 8, 2024

Conversation

nugaon
Copy link
Member

@nugaon nugaon commented Dec 4, 2023

Along with the neighborhood reserve replication and node caching, Reed-Solomon erasure coding and dispersed replicas make the redundancy in Bee.

From these new redundancies, the client can recover requested data from different neighborhoods of the kademlia network.

Details in the PR descriptions

Open API Spec Version Changes (if applicable)

I think we need to change the version after #4529

@nugaon nugaon marked this pull request as ready for review December 4, 2023 18:32
@nugaon nugaon changed the title feat: redundancy feat: kademlia redundancy Dec 5, 2023
openapi/Swarm.yaml Outdated Show resolved Hide resolved
openapi/SwarmCommon.yaml Outdated Show resolved Hide resolved
pkg/api/api.go Outdated Show resolved Hide resolved
pkg/api/bzz.go Outdated Show resolved Hide resolved
pkg/api/dirs.go Outdated Show resolved Hide resolved
pkg/replicas/putter.go Show resolved Hide resolved
pkg/replicas/replicas.go Outdated Show resolved Hide resolved
pkg/replicas/replicas.go Outdated Show resolved Hide resolved
pkg/replicas/replicas.go Outdated Show resolved Hide resolved
pkg/replicas/replicas.go Show resolved Hide resolved
Copy link
Collaborator

@ldeffenb ldeffenb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting the points that make no sense as I try to understand the upcoming redundancy improvements.

pkg/api/dirs.go Outdated Show resolved Hide resolved
pkg/api/api.go Outdated Show resolved Hide resolved
@nugaon nugaon force-pushed the feat/redundancy branch 2 times, most recently from a83b3f7 to a63e161 Compare December 18, 2023 19:43
@nugaon nugaon changed the title feat: kademlia redundancy feat: non-local redundancy Dec 19, 2023
pkg/api/dirs.go Outdated Show resolved Hide resolved
pkg/api/bzz.go Outdated
Cache *bool `map:"Swarm-Cache"`
Strategy getter.Strategy `map:"Swarm-Redundancy-Strategy"`
FallbackMode bool `map:"Swarm-Redundancy-Fallback-Mode"`
ChunkRetrievalTimeout time.Duration `map:"Swarm-Chunk-Retrieval-Timeout"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@istae Configuring this counts as an advanced dev option.
By setting chunk retrieval timeout to a suffienctly short interval one can effectively render some slow chunks not retrievable and thus 'simulate' chunk loss even in environments of perfect availability.
This is especially useful since it allows for client side trigger of redundancy-based recovery of data implemented in this PR.

ctx = getter.SetStrategy(ctx, headers.Strategy)
ctx = getter.SetStrict(ctx, headers.FallbackMode)
ctx = getter.SetFetchTimeout(ctx, headers.ChunkRetrievalTimeout)
reader, l, err := joiner.New(ctx, s.storer.Download(cache), s.storer.Cache(), reference)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storer.Cache() is not a reliable source for storing chunks for any operation, what is the purpose of using the cache here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should I put here? there is no other putter available under s.storer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of using the cache here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so cache must be used to store things locally that have no postage stamp. There are three ways to get there:

  • evicted from reserve (either too distant, too cheap or expired)
  • landed with us through retrieval and has no postage stamp (yet)
  • and now also created with the help of parities, no postage stamp available (yet).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets get to the bottom of this.

Reconstructed chunks have no postage stamp therefore they cannot be in the reserve. Putting them in a pinstore is wasteful and unclear in terms of expiry.
The only possible place to put is the localstore cache.

The status of reconstructed chunks is similar to chunks obtained through retrieval which is also cached.
The only thing we absolutely need to make sure of is that these cached chunks can and will be put in the reserve once they are offered by peers as part of pullsyncing as a chunk with a valid postage stamp.

There is one caveat about using the cache is the intricate scenario described and resolved in this hackmdQq https://hackmd.io/@zelig/Bkqol64dp

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An even worse consideration is that the cache size is user-specified and can therefore be set to an arbitrarily small value, possibly ensuring that the reconstructed chunks have already been purged before they are needed again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am introducing a recoder accordingly
resolved in #4529

pkg/file/redundancy/level.go Show resolved Hide resolved
pkg/file/redundancy/level.go Outdated Show resolved Hide resolved
pkg/file/redundancy/level.go Show resolved Hide resolved
pkg/file/redundancy/level.go Outdated Show resolved Hide resolved
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

// the code below implements the integration of dispersed replicas in chunk fetching.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unresolved.

wg.Wait()
close(errc)
for err := range errc {
errs = append(errs, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unresolved.

pkg/replicas/putter.go Show resolved Hide resolved
pkg/replicas/replicas.go Outdated Show resolved Hide resolved
pkg/replicas/replicas.go Show resolved Hide resolved
pkg/file/redundancy/level.go Outdated Show resolved Hide resolved
pkg/api/api.go Outdated Show resolved Hide resolved
openapi/SwarmCommon.yaml Outdated Show resolved Hide resolved
pkg/api/bzz.go Outdated Show resolved Hide resolved
pkg/file/pipeline/hashtrie/hashtrie_test.go Outdated Show resolved Hide resolved
pkg/file/redundancy/getter/getter.go Outdated Show resolved Hide resolved
@zelig zelig assigned zelig and nugaon Jan 6, 2024
pkg/file/redundancy/getter/getter.go Outdated Show resolved Hide resolved
pkg/file/redundancy/getter/strategies.go Outdated Show resolved Hide resolved
}
defer cancelAll()
run := func(s Strategy) error {
if s == PROX { // NOT IMPLEMENTED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if strategy is DATA? because the strategy is incremented below, then for cases where the strategy is DATA, no other strategy is tried because PROX is not implemented ?


var stop <-chan time.Time
if s < RACE {
timer := time.NewTimer(strategyTimeout)
Copy link
Member

@istae istae Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max time allowed to fetch all the chunks from the network is 500 ms?

pkg/file/redundancy/getter/strategies.go Outdated Show resolved Hide resolved
pkg/file/redundancy/getter/getter.go Outdated Show resolved Hide resolved
continue
default:
}
_ = g.fly(i, true) // commit (RS) or will commit to retrieve the chunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does missing call set the chunks in flight?

// Get will call parities and other sibling chunks if the chunk address cannot be retrieved
// assumes it is called for data shards only
func (g *decoder) Get(ctx context.Context, addr swarm.Address) (swarm.Chunk, error) {
i, ok := g.cache[addr.ByteString()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this cache even necessary?
would the GET call receive an addr not part of this decoder??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, so the addr that the Get receives is the address of a chunk within the scope of its parent (packed address chunk = intermediate chunk).
The decoders for every parent scope is cached in the joiner.
This cache here, is actually an index mapping addresses (children of the parent) to position.
Should be renamed probably

if !ok {
return nil, storage.ErrNotFound
}
if g.fly(i, true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the prefetch calls essentially sets all of these chunks to inflight, no?
why is this fly check necessary, why can't we jump to the select below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we want singleflight behaviour on fetching, ie if prefetching fetches a chunk,, then its queried with joiner Get --> decoder Get (or the other way round), then we should just wait on the inflight fetch.
Similarly, when prefetch fetched shardCnt chunks the other chunks can be put to inflight, so that if they are `Get-ed' by the joiner they are not fetched but wait till revovered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And of course there is not always a prefetch on every data chunk (NONE and PROX will not prefetch some chunks)

// if all chunks are retrieved, signal ready
n := g.fetchedCnt.Add(1)
if n == int32(g.shardCnt) {
close(g.ready) // signal that just enough chunks are retrieved for decoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we allowed to close ready if the shardCnt includes parity chunks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surely


// Get will call parities and other sibling chunks if the chunk address cannot be retrieved
// assumes it is called for data shards only
func (g *decoder) Get(ctx context.Context, addr swarm.Address) (swarm.Chunk, error) {
Copy link
Member

@istae istae Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what this Get should ideally do is to wait for either the chunk channel g.waits[i] to finish OR wait for a signal that the recovery has finished (and the chunk is available).
there should also be an error channel that returns when a chunk is not fetchable AND unrecoverable which would translate into a storage.ErrNotFound.

as it stands, this GET nevers properly returns an error, and simply waits for a context timeout in this case that recovery and/or fetch failed.

Co-authored-by: nugaon <[email protected]>
Co-authored-by: Anatol <[email protected]>
Co-authored-by: dysordys <[email protected]>
Co-authored-by: Gyorgy Barabas <[email protected]>
acha-bill
acha-bill previously approved these changes Feb 7, 2024
@acha-bill acha-bill dismissed their stale review February 7, 2024 12:00

will review after conflicts are resolved

@istae istae merged commit 0ece898 into master Feb 8, 2024
12 checks passed
@istae istae deleted the feat/redundancy branch February 8, 2024 10:15
istae added a commit that referenced this pull request Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants